AITopics | adaptively aligned image captioning

Adaptively Aligned Image Captioning via Adaptive Attention Time

Neural Information Processing SystemsDec-26-2025, 04:55:49 GMT

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. AAT is deterministic and differentiable, and doesn't introduce any noise to the parameter gradients. In this paper, we empirically show that AAT improves over state-of-the-art methods on the task of image captioning. Code is available at https://github.com/husthuaan/AAT.

adaptive attention time, adaptively aligned image captioning, caption word, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

Reviews: Adaptively Aligned Image Captioning via Adaptive Attention Time

Neural Information Processing SystemsFeb-5-2025, 23:46:49 GMT

Although the two techniques have been well explored individually, this is the first work combining it for attention for image captioning. This should make reproducing the results easier. The base attention model already is doing much better than up-down attention and recent methods like GCN-LSTM and so it's not clear where the gains are coming from. It'd be good to see AAT applied to traditional single-head attention instead of multi-head attention to convincingly show that AAT helps. For instance, how does the attention time steps vary with word position in the caption?

adaptive attention time, adaptively aligned image captioning, self-critical training, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.74)

Add feedback

Reviews: Adaptively Aligned Image Captioning via Adaptive Attention Time

Neural Information Processing SystemsFeb-5-2025, 23:46:39 GMT

After feedback and reviewer discussion, this paper received final ratings of 6, 7 and 7. Although the novelty of the proposed model is relatively minor in the context of previous work proposing Adaptive Computation Time (Graves 2016), the reviewers were impressed by the empirical performance and praised the detailed ablation studies (including the additional experiments with single-headed attention in the author feedback, which was important in reaching the final consensus view of reviewers to accept this paper). We encourage the authors to follow the suggestion of R1 (cut down space devoted to standard captioning components in Secs 3.2.1,

adaptive attention time, adaptively aligned image captioning, author feedback, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Vision (0.40)

Add feedback

Adaptively Aligned Image Captioning via Adaptive Attention Time

Neural Information Processing SystemsOct-11-2024, 08:23:36 GMT

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions.

adaptive attention time, adaptively aligned image captioning, caption word, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Machine Learning (0.83)

Add feedback

Adaptively Aligned Image Captioning via Adaptive Attention Time

Huang, Lun, Wang, Wenmin, Xia, Yaxian, Chen, Jie

Neural Information Processing SystemsMar-19-2020, 00:16:07 GMT

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions.

adaptive attention time, adaptively aligned image captioning, caption word, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Artificial Intelligence > Machine Learning (0.88)

Add feedback

Filters

Collaborating Authors

adaptively aligned image captioning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Adaptively Aligned Image Captioning via Adaptive Attention Time

Reviews: Adaptively Aligned Image Captioning via Adaptive Attention Time

Reviews: Adaptively Aligned Image Captioning via Adaptive Attention Time

Adaptively Aligned Image Captioning via Adaptive Attention Time

Adaptively Aligned Image Captioning via Adaptive Attention Time